SgmlV1, Main, Exploration, bibRecord, 000495

Machine learning for Asian language text classification

Identifieur interne : 000495 ( Main/Exploration ); précédent : 000494; suivant : 000496

Machine learning for Asian language text classification

Auteurs : Fuchun Peng [États-Unis] ; Xiangji Huang [Canada]

Source :

Journal of Documentation [ 0022-0418 ] ; 2007-05-01.

RBID : ISTEX:6EF70F3DF40FE6C3EF63BBF334B5EDF3E88053F9

English descriptors

Teeft :
- Algorithm, Asian language, Asian language text, Bayes, Bayes model, Bayes result, Best compression, Best result, Binary problems, Byte level models, Categorization, Certain level, Character level, Character level features, Chinese data, Chinese data table, Chinese experiments, Chinese text, Chinese word segmentation, Class label, Computational linguistics, Entropy, Exponential form, Feature engineering, Feature selection, Feature space, Higher order models, Ieee transactions, Individual scores, Information retrieval, International conference, Japanese data, Japanese text, Jdoc, Language model, Language modeling, Language modeling approach, Language models, Large number, Markov independence assumption, Maximum entropy, Maximum entropy distribution, Modeling, Multinomial model, Mutual information, Natural language, Natural language processing, Negative examples, Overall accuracy, Peng, Perplexity, Retrieval, Segmentation, Segmentation accuracy, Segmentation performance, Sparse data problems, Standard approaches, Standard techniques, Standard text, Statistical language modeling, Support vector machine, Support vector machines, Support vectors, Table viii, Test corpus, Test document, Text categorization, Text categorization problems, Text retrieval, Training data, Training examples, Uncommon features, Word counts, Word level, Word level features, Word segmentation, Word segmentation accuracies, Word segmentation accuracy, Word segmentation information, Word sequences.

Abstract

Purpose The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Designmethodologyapproach Nave Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentationbased approach was compared with the nonsegmentationbased approach. Findings There were two findings the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications Apply the findings to real web text classification is ongoing work. Originalityvalue The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.

Url:

https://api.istex.fr/ark:/67375/4W2-6TSJ99BC-N/fulltext.pdf

DOI: 10.1108/00220410710743306

Affiliations:

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 001D10
to stream Istex, to step Curation: 001701
to stream Istex, to step Checkpoint: 000447
to stream Main, to step Merge: 000498
to stream Main, to step Curation: 000495

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Machine learning for Asian language text classification</title>
<author><name sortKey="Peng, Fuchun" sort="Peng, Fuchun" uniqKey="Peng F" first="Fuchun" last="Peng">Fuchun Peng</name>
</author>
<author><name sortKey="Huang, Xiangji" sort="Huang, Xiangji" uniqKey="Huang X" first="Xiangji" last="Huang">Xiangji Huang</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:6EF70F3DF40FE6C3EF63BBF334B5EDF3E88053F9</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1108/00220410710743306</idno>
<idno type="url">https://api.istex.fr/ark:/67375/4W2-6TSJ99BC-N/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001D10</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001D10</idno>
<idno type="wicri:Area/Istex/Curation">001701</idno>
<idno type="wicri:Area/Istex/Checkpoint">000447</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000447</idno>
<idno type="wicri:doubleKey">0022-0418:2007:Peng F:machine:learning:for</idno>
<idno type="wicri:Area/Main/Merge">000498</idno>
<idno type="wicri:Area/Main/Curation">000495</idno>
<idno type="wicri:Area/Main/Exploration">000495</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Machine learning for Asian language text classification</title>
<author><name sortKey="Peng, Fuchun" sort="Peng, Fuchun" uniqKey="Peng F" first="Fuchun" last="Peng">Fuchun Peng</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Yahoo Inc., Sunnyvale, California</wicri:regionArea>
<placeName><region type="state">Californie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Huang, Xiangji" sort="Huang, Xiangji" uniqKey="Huang X" first="Xiangji" last="Huang">Xiangji Huang</name>
<affiliation wicri:level="1"><country xml:lang="fr">Canada</country>
<wicri:regionArea>School of Information Technology, York University, Toronto</wicri:regionArea>
<wicri:noRegion>Toronto</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Journal of Documentation</title>
<idno type="ISSN">0022-0418</idno>
<imprint><publisher>Emerald Group Publishing Limited</publisher>
<date type="published" when="2007-05-01">2007-05-01</date>
<biblScope unit="volume">63</biblScope>
<biblScope unit="issue">3</biblScope>
<biblScope unit="page" from="378">378</biblScope>
<biblScope unit="page" to="397">397</biblScope>
</imprint>
<idno type="ISSN">0022-0418</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0022-0418</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="Teeft" xml:lang="en"><term>Algorithm</term>
<term>Asian language</term>
<term>Asian language text</term>
<term>Bayes</term>
<term>Bayes model</term>
<term>Bayes result</term>
<term>Best compression</term>
<term>Best result</term>
<term>Binary problems</term>
<term>Byte level models</term>
<term>Categorization</term>
<term>Certain level</term>
<term>Character level</term>
<term>Character level features</term>
<term>Chinese data</term>
<term>Chinese data table</term>
<term>Chinese experiments</term>
<term>Chinese text</term>
<term>Chinese word segmentation</term>
<term>Class label</term>
<term>Computational linguistics</term>
<term>Entropy</term>
<term>Exponential form</term>
<term>Feature engineering</term>
<term>Feature selection</term>
<term>Feature space</term>
<term>Higher order models</term>
<term>Ieee transactions</term>
<term>Individual scores</term>
<term>Information retrieval</term>
<term>International conference</term>
<term>Japanese data</term>
<term>Japanese text</term>
<term>Jdoc</term>
<term>Language model</term>
<term>Language modeling</term>
<term>Language modeling approach</term>
<term>Language models</term>
<term>Large number</term>
<term>Markov independence assumption</term>
<term>Maximum entropy</term>
<term>Maximum entropy distribution</term>
<term>Modeling</term>
<term>Multinomial model</term>
<term>Mutual information</term>
<term>Natural language</term>
<term>Natural language processing</term>
<term>Negative examples</term>
<term>Overall accuracy</term>
<term>Peng</term>
<term>Perplexity</term>
<term>Retrieval</term>
<term>Segmentation</term>
<term>Segmentation accuracy</term>
<term>Segmentation performance</term>
<term>Sparse data problems</term>
<term>Standard approaches</term>
<term>Standard techniques</term>
<term>Standard text</term>
<term>Statistical language modeling</term>
<term>Support vector machine</term>
<term>Support vector machines</term>
<term>Support vectors</term>
<term>Table viii</term>
<term>Test corpus</term>
<term>Test document</term>
<term>Text categorization</term>
<term>Text categorization problems</term>
<term>Text retrieval</term>
<term>Training data</term>
<term>Training examples</term>
<term>Uncommon features</term>
<term>Word counts</term>
<term>Word level</term>
<term>Word level features</term>
<term>Word segmentation</term>
<term>Word segmentation accuracies</term>
<term>Word segmentation accuracy</term>
<term>Word segmentation information</term>
<term>Word sequences</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract">Purpose The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Designmethodologyapproach Nave Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentationbased approach was compared with the nonsegmentationbased approach. Findings There were two findings the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications Apply the findings to real web text classification is ongoing work. Originalityvalue The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.</div>
</front>
</TEI>
<affiliations><list><country><li>Canada</li>
<li>États-Unis</li>
</country>
<region><li>Californie</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Californie"><name sortKey="Peng, Fuchun" sort="Peng, Fuchun" uniqKey="Peng F" first="Fuchun" last="Peng">Fuchun Peng</name>
</region>
</country>
<country name="Canada"><noRegion><name sortKey="Huang, Xiangji" sort="Huang, Xiangji" uniqKey="Huang X" first="Xiangji" last="Huang">Xiangji Huang</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Informatique/explor/SgmlV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000495 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000495 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Informatique
   |area=    SgmlV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:6EF70F3DF40FE6C3EF63BBF334B5EDF3E88053F9
   |texte=   Machine learning for Asian language text classification
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jul 1 14:26:08 2019. Site generation: Wed Apr 28 21:40:44 2021

	Serveur d'exploration sur SGML
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur SGML

Machine learning for Asian language text classification

Machine learning for Asian language text classification

Source :

English descriptors

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri